We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.
translated by 谷歌翻译
Neglected tropical diseases (NTDs) continue to affect the livelihood of individuals in countries in the Southeast Asia and Western Pacific region. These diseases have been long existing and have caused devastating health problems and economic decline to people in low- and middle-income (developing) countries. An estimated 1.7 billion of the world's population suffer one or more NTDs annually, this puts approximately one in five individuals at risk for NTDs. In addition to health and social impact, NTDs inflict significant financial burden to patients, close relatives, and are responsible for billions of dollars lost in revenue from reduced labor productivity in developing countries alone. There is an urgent need to better improve the control and eradication or elimination efforts towards NTDs. This can be achieved by utilizing machine learning tools to better the surveillance, prediction and detection program, and combat NTDs through the discovery of new therapeutics against these pathogens. This review surveys the current application of machine learning tools for NTDs and the challenges to elevate the state-of-the-art of NTDs surveillance, management, and treatment.
translated by 谷歌翻译
自然语言和生物学序列之间的明显相似之处已导致最新的深层语言模型(LMS)在抗体和其他生物学序列分析中的应用激增。但是,缺乏对生物序列语言的严格语言形式化,这些语言将定义基本组成部分,例如词典(即语言的离散单元)和语法(即,将序列序列良好的规则,结构和结构和结构和结构和结构链接的规则链接在一起含义)导致了LMS的主要域无规定应用,这些应用未考虑研究的生物序列的基础结构。另一方面,语言形式化为LM应用建立了语言信息,因此适应域的组件。它将有助于更好地理解自然语言和生物序列之间的差异和相似性如何影响LMS的质量,这对于具有可解释的模型具有可解释的模型至关重要。解密抗体特异性规则对于加速有理和硅生物治疗药物设计至关重要。在这里,我们将抗体语言的特性形式化,因此不仅建立了语言工具在适应性免疫受体分析中应用的基础,而且还为免疫受体特异性的系统免疫语言学研究提供了基础。
translated by 谷歌翻译
基于神经网络的深层语言模型(LMS)越来越多地应用于大规模蛋白质序列数据以预测蛋白质功能。然而,作为黑框模型,当前的蛋白质LM方法并不促进对序列功能映射的基本理解,而阻碍了基于规则的生物治疗药物开发,因此目前的蛋白质LM方法不大。我们认为,从语言学中得出的指导是从自然语言数据中提取分析规则的领域,可以帮助构建学习相关领域特定规则的更容易解释的蛋白质LM。与自然语言LMS相比,蛋白质序列数据和语言序列数据之间的差异需要在蛋白质LMS中集成更多的域特异性知识。在这里,我们为培训数据,令牌化,令牌嵌入,序列嵌入和模型解释提供了基于语言学的路线图。将语言学与蛋白质LMS结合起来,可以发展下一代可解释的机器学习模型,并有可能发现序列功能关系基础的生物学机制。
translated by 谷歌翻译
Using 3D CNNs on high resolution medical volumes is very computationally demanding, especially for large datasets like the UK Biobank which aims to scan 100,000 subjects. Here we demonstrate that using 2D CNNs on a few 2D projections (representing mean and standard deviation across axial, sagittal and coronal slices) of the 3D volumes leads to reasonable test accuracy when predicting the age from brain volumes. Using our approach, one training epoch with 20,324 subjects takes 40 - 70 seconds using a single GPU, which is almost 100 times faster compared to a small 3D CNN. These results are important for researchers who do not have access to expensive GPU hardware for 3D CNNs.
translated by 谷歌翻译
Large annotated datasets are required to train segmentation networks. In medical imaging, it is often difficult, time consuming and expensive to create such datasets, and it may also be difficult to share these datasets with other researchers. Different AI models can today generate very realistic synthetic images, which can potentially be openly shared as they do not belong to specific persons. However, recent work has shown that using synthetic images for training deep networks often leads to worse performance compared to using real images. Here we demonstrate that using synthetic images and annotations from an ensemble of 10 GANs, instead of from a single GAN, increases the Dice score on real test images with 4.7 % to 14.0 % on specific classes.
translated by 谷歌翻译
Machine learning is the study of computer algorithms that can automatically improve based on data and experience. Machine learning algorithms build a model from sample data, called training data, to make predictions or judgments without being explicitly programmed to do so. A variety of wellknown machine learning algorithms have been developed for use in the field of computer science to analyze data. This paper introduced a new machine learning algorithm called impact learning. Impact learning is a supervised learning algorithm that can be consolidated in both classification and regression problems. It can furthermore manifest its superiority in analyzing competitive data. This algorithm is remarkable for learning from the competitive situation and the competition comes from the effects of autonomous features. It is prepared by the impacts of the highlights from the intrinsic rate of natural increase (RNI). We, moreover, manifest the prevalence of the impact learning over the conventional machine learning algorithm.
translated by 谷歌翻译
智能仪表测量值虽然对于准确的需求预测至关重要,但仍面临一些缺点,包括消费者的隐私,数据泄露问题,仅举几例。最近的文献探索了联合学习(FL)作为一种有前途的隐私机器学习替代方案,该替代方案可以协作学习模型,而无需将私人原始数据暴露于短期负载预测中。尽管有着美德,但标准FL仍然容易受到棘手的网络威胁,称为拜占庭式攻击,这是由错误和/或恶意客户进行的。因此,为了提高联邦联邦短期负载预测对拜占庭威胁的鲁棒性,我们开发了一个最先进的基于私人安全的FL框架,以确保单个智能电表的数据的隐私,同时保护FL的安全性模型和架构。我们提出的框架利用了通过符号随机梯度下降(SignsGD)算法的梯度量化的想法,在本地模型培训后,客户仅将梯度的“符号”传输到控制中心。当我们通过涉及一组拜占庭攻击模型的基准神经网络的实验突出显示时,我们提出的方法会非常有效地减轻此类威胁,从而优于常规的FED-SGD模型。
translated by 谷歌翻译
在阻止印尼自然语言处理(NLP)研究进步的基本问题的中心,我们发现数据稀缺。印尼语言,尤其是当地语言的资源极为稀缺和代表性不足。许多印尼研究人员没有发布其数据集。此外,我们拥有的少数公共数据集散布在不同的平台上,因此使印尼NLP的可重复性和以数据为中心的研究更加艰巨。面对这一挑战,我们开始了第一个印尼NLP众包努力,Nusacrowd。Nusacrowd努力为所有印尼语言中的NLP任务提供标准化数据加载,以提供最大的数据表聚合。通过使印尼NLP资源的开放式和集中式访问能力,我们希望Nusacrowd可以解决阻碍印度尼西亚NLP进展的数据稀缺问题,并将NLP从业者带来合作。
translated by 谷歌翻译
自2020年初以来,Covid-19-19造成了全球重大影响。这给社会带来了很多困惑,尤其是由于错误信息通过社交媒体传播。尽管已经有几项与在社交媒体数据中发现错误信息有关的研究,但大多数研究都集中在英语数据集上。印度尼西亚的COVID-19错误信息检测的研究仍然很少。因此,通过这项研究,我们收集和注释印尼语的数据集,并通过考虑该推文的相关性来构建用于检测COVID-19错误信息的预测模型。数据集构造是由一组注释者进行的,他们标记了推文数据的相关性和错误信息。在这项研究中,我们使用印度培训预培训的语言模型提出了两阶段分类器模型,以进行推文错误信息检测任务。我们还尝试了其他几种基线模型进行文本分类。实验结果表明,对于相关性预测,BERT序列分类器的组合和用于错误信息检测的BI-LSTM的组合优于其他机器学习模型,精度为87.02%。总体而言,BERT利用率有助于大多数预测模型的更高性能。我们发布了高质量的Covid-19错误信息推文语料库,用高通道一致性表示。
translated by 谷歌翻译